cluster label
Explainable cluster analysis: a bagging approach
Quetti, Federico Maria, Ballante, Elena, Figini, Silvia, Giudici, Paolo
A major limitation of clustering approaches is their lack of explainability: methods rarely provide insight into which features drive the grouping of similar observations. To address this limitation, we propose an ensemble-based clustering framework that integrates bagging and feature dropout to generate feature importance scores, in analogy with feature importance mechanisms in supervised random forests. By leveraging multiple bootstrap resampling schemes and aggregating the resulting partitions, the method improves stability and robustness of the cluster definition, particularly in small-sample or noisy settings. Feature importance is assessed through an information-theoretic approach: at each step, the mutual information between each feature and the estimated cluster labels is computed and weighted by a measure of clustering validity to emphasize well-formed partitions, before being aggregated into a final score. The method outputs both a consensus partition and a corresponding measure of feature importance, enabling a unified interpretation of clustering structure and variable relevance. Its effectiveness is demonstrated on multiple simulated and real-world datasets.
- Europe > Italy (0.04)
- North America > United States > New Jersey > Hudson County > Hoboken (0.04)
- North America > United States > Wisconsin (0.04)
- (4 more...)
- North America > United States > California > Los Angeles County > Los Angeles (0.29)
- Europe > Austria > Vienna (0.14)
- Oceania > Australia > New South Wales > Sydney (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Wittgenstein's Family Resemblance Clustering Algorithm
Amanpour, Golbahar, Ghojogh, Benyamin
This paper, introducing a novel method in philo-matics, draws on Wittgenstein's concept of family resemblance from analytic philosophy to develop a clustering algorithm for machine learning. According to Wittgenstein's Philosophical Investigations (1953), family resemblance holds that members of a concept or category are connected by overlapping similarities rather than a single defining property. Consequently, a family of entities forms a chain of items sharing overlapping traits. This philosophical idea naturally lends itself to a graph-based approach in machine learning. Accordingly, we propose the Wittgenstein's Family Resemblance (WFR) clustering algorithm and its kernel variant, kernel WFR. This algorithm computes resemblance scores between neighboring data instances, and after thresholding these scores, a resemblance graph is constructed. The connected components of this graph define the resulting clusters. Simulations on benchmark datasets demonstrate that WFR is an effective nonlinear clustering algorithm that does not require prior knowledge of the number of clusters or assumptions about their shapes.
In-Context Clustering with Large Language Models
Wang, Ying, Ren, Mengye, Wilson, Andrew Gordon
We propose In-Context Clustering (ICC), a flexible LLM-based procedure for clustering data from diverse distributions. Unlike traditional clustering algorithms constrained by predefined similarity measures, ICC flexibly captures complex relationships among inputs through an attention mechanism. We show that pretrained LLMs exhibit impressive zero-shot clustering capabilities on text-encoded numeric data, with attention matrices showing salient cluster patterns. Spectral clustering using attention matrices offers surprisingly competitive performance. We further enhance the clustering capabilities of LLMs on numeric and image data through fine-tuning using the Next Token Prediction (NTP) loss. Moreover, the flexibility of LLM prompting enables text-conditioned image clustering, a capability that classical clustering methods lack. Our work extends in-context learning to an unsupervised setting, showcasing the effectiveness and flexibility of LLMs for clustering. Our code is available at https://agenticlearning.ai/icc.
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)
- North America > United States > Virginia (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > California > Yolo County > Davis (0.14)
- North America > Canada > Ontario > Toronto (0.04)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
Appendix for " Label consistency in overfitted generalized k-means "
Figure S1(a) shows a noisy circle-torus model (cf. To show the result, it is enough to use Theorem 2 with properly chosen (fake) centers on the above dataset. The code is provided as a ZIP file as part of the supplementary material. True clusters are distinguished by their color. We have the following extension of Proposition 3. Proposition S1.
- North America > United States > California > Los Angeles County > Los Angeles (0.29)
- Europe > Austria > Vienna (0.14)
- Oceania > Australia > New South Wales > Sydney (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
A Conditional GAN for Tabular Data Generation with Probabilistic Sampling of Latent Subspaces
Akritidis, Leonidas, Bozanis, Panayiotis
The tabular form constitutes the standard way of representing data in relational database systems and spreadsheets. But, similarly to other forms, tabular data suffers from class imbalance, a problem that causes serious performance degradation in a wide variety of machine learning tasks. One of the most effective solutions dictates the usage of Generative Adversarial Networks (GANs) in order to synthesize artificial data instances for the under-represented classes. Despite their good performance, none of the proposed GAN models takes into account the vector subspaces of the input samples in the real data space, leading to data generation in arbitrary locations. Moreover, the class labels are treated in the same manner as the other categorical variables during training, so conditional sampling by class is rendered less effective. To overcome these problems, this study presents ctdGAN, a conditional GAN for alleviating class imbalance in tabular datasets. Initially, ctdGAN executes a space partitioning step to assign cluster labels to the input samples. Subsequently, it utilizes these labels to synthesize samples via a novel probabilistic sampling strategy and a new loss function that penalizes both cluster and class mis-predictions. In this way, ctdGAN is trained to generate samples in subspaces that resemble those of the original data distribution. We also introduce several other improvements, including a simple, yet effective cluster-wise scaling technique that captures multiple feature modes without affecting data dimensionality. The exhaustive evaluation of ctdGAN with 14 imbalanced datasets demonstrated its superiority in generating high fidelity samples and improving classification accuracy.
- North America > United States (0.04)
- Europe > Greece > West Greece > Patra (0.04)
- Europe > Greece > Central Macedonia > Thessaloniki (0.04)
- Europe > France (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)
Explaining Recovery Trajectories of Older Adults Post Lower-Limb Fracture Using Modality-wise Multiview Clustering and Large Language Models
Khan, Shehroz S., Abedi, Ali, Chu, Charlene H.
Interpreting large volumes of high-dimensional, unlabeled data in a manner that is comprehensible to humans remains a significant challenge across various domains. In unsupervised healthcare data analysis, interpreting clustered data can offer meaningful insights into patients' health outcomes, which hold direct implications for healthcare providers. This paper addresses the problem of interpreting clustered sensor data collected from older adult patients recovering from lower-limb fractures in the community. A total of 560 days of multimodal sensor data, including acceleration, step count, ambient motion, GPS location, heart rate, and sleep, alongside clinical scores, were remotely collected from patients at home. Clustering was first carried out separately for each data modality to assess the impact of feature sets extracted from each modality on patients' recovery trajectories. Then, using context-aware prompting, a large language model was employed to infer meaningful cluster labels for the clusters derived from each modality. The quality of these clusters and their corresponding labels was validated through rigorous statistical testing and visualization against clinical scores collected alongside the multimodal sensor data. The results demonstrated the statistical significance of most modality-specific cluster labels generated by the large language model with respect to clinical scores, confirming the efficacy of the proposed method for interpreting sensor data in an unsupervised manner. This unsupervised data analysis approach, relying solely on sensor data, enables clinicians to identify at-risk patients and take timely measures to improve health outcomes.
- Health & Medicine > Consumer Health (1.00)
- Health & Medicine > Therapeutic Area (0.70)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Lightweight Trustworthy Distributed Clustering
Li, Hongyang, Wu, Caesar, Chadli, Mohammed, Mammar, Said, Bouvry, Pascal
Ensuring data trustworthiness within individual edge node s while facilitating collaborative data processing poses a critical challenge in edge computing system s (ECS), particularly in resource-constrained scenarios such as autonomous systems sensor networks, indu strial IoT, and smart cities. This paper presents a lightweight, fully distributed k -means clustering algorithm specifically adapted for edge e nvi-ronments, leveraging a distributed averaging approach wit h additive secret sharing, a secure multiparty computation technique, during the cluster center update ph ase to ensure the accuracy and trustworthiness of data across nodes. Edge computing, a paradigm emerging from distributed compu ting, emphasizes processing data at or near its source to minimize latency and reduce band width consumption [1]-[3]. The rapid advancements in edge computing technologies, includ ing algorithms for decentralized and efficient data processing, have significantly accelerated t he deployment of distributed sensor networks. Two key properties of ECS are crucial in large-scale deploym ents.
- Information Technology > Security & Privacy (1.00)
- Information Technology > Communications > Networks > Sensor Networks (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.57)